This dataset is public available for research. The details are described in [Cortez et al., 2009].
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## 'data.frame': 1599 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
## $ wine.grade : chr "B" "B" "B" "B" ...
## [1] 217
## [1] 1319
## [1] 63
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
## [1] "Sulphates"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
## [1] "Free sulfur dioxide"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## $x
## [1] "Total sulfur dioxide (g / dm^3)"
##
## attr(,"class")
## [1] "labels"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## [1] "Citric acid"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## [1] "Fixed acidity"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## X fixed.acidity volatile.acidity citric.acid
## Min. : 8.0 Min. : 4.900 Min. :0.1200 Min. :0.0000
## 1st Qu.: 482.0 1st Qu.: 7.400 1st Qu.:0.3000 1st Qu.:0.3000
## Median : 939.0 Median : 8.700 Median :0.3700 Median :0.4000
## Mean : 831.7 Mean : 8.847 Mean :0.4055 Mean :0.3765
## 3rd Qu.:1089.0 3rd Qu.:10.100 3rd Qu.:0.4900 3rd Qu.:0.4900
## Max. :1585.0 Max. :15.600 Max. :0.9150 Max. :0.7600
## residual.sugar chlorides free.sulfur.dioxide
## Min. :1.200 Min. :0.01200 Min. : 3.00
## 1st Qu.:2.000 1st Qu.:0.06200 1st Qu.: 6.00
## Median :2.300 Median :0.07300 Median :11.00
## Mean :2.709 Mean :0.07591 Mean :13.98
## 3rd Qu.:2.700 3rd Qu.:0.08500 3rd Qu.:18.00
## Max. :8.900 Max. :0.35800 Max. :54.00
## total.sulfur.dioxide density pH sulphates
## Min. : 7.00 Min. :0.9906 Min. :2.880 Min. :0.3900
## 1st Qu.: 17.00 1st Qu.:0.9947 1st Qu.:3.200 1st Qu.:0.6500
## Median : 27.00 Median :0.9957 Median :3.270 Median :0.7400
## Mean : 34.89 Mean :0.9960 Mean :3.289 Mean :0.7435
## 3rd Qu.: 43.00 3rd Qu.:0.9973 3rd Qu.:3.380 3rd Qu.:0.8200
## Max. :289.00 Max. :1.0032 Max. :3.780 Max. :1.3600
## alcohol quality wine.grade
## Min. : 9.20 3: 0 Length:217
## 1st Qu.:10.80 4: 0 Class :character
## Median :11.60 5: 0 Mode :character
## Mean :11.52 6: 0
## 3rd Qu.:12.20 7:199
## Max. :14.00 8: 18
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.700 Min. :0.1600 Min. :0.0000
## 1st Qu.: 382.5 1st Qu.: 7.100 1st Qu.:0.4100 1st Qu.:0.0900
## Median : 768.0 Median : 7.800 Median :0.5400 Median :0.2400
## Mean : 793.0 Mean : 8.254 Mean :0.5386 Mean :0.2583
## 3rd Qu.:1219.5 3rd Qu.: 9.100 3rd Qu.:0.6400 3rd Qu.:0.4000
## Max. :1599.0 Max. :15.900 Max. :1.3300 Max. :0.7900
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.03400 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07100 1st Qu.: 8.00
## Median : 2.200 Median :0.08000 Median :14.00
## Mean : 2.504 Mean :0.08897 Mean :16.37
## 3rd Qu.: 2.600 3rd Qu.:0.09100 3rd Qu.:22.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.860 Min. :0.3700
## 1st Qu.: 24.00 1st Qu.:0.9958 1st Qu.:3.210 1st Qu.:0.5400
## Median : 40.00 Median :0.9968 Median :3.310 Median :0.6100
## Mean : 48.95 Mean :0.9969 Mean :3.311 Mean :0.6473
## 3rd Qu.: 65.00 3rd Qu.:0.9979 3rd Qu.:3.400 3rd Qu.:0.7000
## Max. :165.00 Max. :1.0037 Max. :4.010 Max. :1.9800
## alcohol quality wine.grade
## Min. : 8.40 3: 0 Length:1319
## 1st Qu.: 9.50 4: 0 Class :character
## Median :10.00 5:681 Mode :character
## Mean :10.25 6:638
## 3rd Qu.:10.90 7: 0
## Max. :14.90 8: 0
## X fixed.acidity volatile.acidity citric.acid
## Min. : 19.0 Min. : 4.600 Min. :0.2300 Min. :0.0000
## 1st Qu.: 435.0 1st Qu.: 6.800 1st Qu.:0.5650 1st Qu.:0.0200
## Median : 834.0 Median : 7.500 Median :0.6800 Median :0.0800
## Mean : 837.7 Mean : 7.871 Mean :0.7242 Mean :0.1737
## 3rd Qu.:1285.5 3rd Qu.: 8.400 3rd Qu.:0.8825 3rd Qu.:0.2700
## Max. :1522.0 Max. :12.500 Max. :1.5800 Max. :1.0000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 1.200 Min. :0.04500 Min. : 3.00
## 1st Qu.: 1.900 1st Qu.:0.06850 1st Qu.: 5.00
## Median : 2.100 Median :0.08000 Median : 9.00
## Mean : 2.685 Mean :0.09573 Mean :12.06
## 3rd Qu.: 2.950 3rd Qu.:0.09450 3rd Qu.:15.50
## Max. :12.900 Max. :0.61000 Max. :41.00
## total.sulfur.dioxide density pH sulphates
## Min. : 7.00 Min. :0.9934 Min. :2.740 Min. :0.3300
## 1st Qu.: 13.50 1st Qu.:0.9957 1st Qu.:3.300 1st Qu.:0.4950
## Median : 26.00 Median :0.9966 Median :3.380 Median :0.5600
## Mean : 34.44 Mean :0.9967 Mean :3.384 Mean :0.5922
## 3rd Qu.: 48.00 3rd Qu.:0.9977 3rd Qu.:3.500 3rd Qu.:0.6000
## Max. :119.00 Max. :1.0010 Max. :3.900 Max. :2.0000
## alcohol quality wine.grade
## Min. : 8.40 3:10 Length:63
## 1st Qu.: 9.60 4:53 Class :character
## Median :10.00 5: 0 Mode :character
## Mean :10.22 6: 0
## 3rd Qu.:11.00 7: 0
## Max. :13.10 8: 0
Dataset contains 1599 observations with 13 variable. A categorical variable has been added (wine.grade).
- Studying the variation of pH in this wine sample.
- The relationship between the percentage of alcohol and the resulting quality of wine.
I suspect that sulphates and the pH index have a deep impact on quality.
Yes, I have created the wine.grade variable.
I performed some transformations and taken some quantiles to better understand the graphs but overall the data is tidy.
There’s a negative correlation between Quality and Volatile Acidity.
There’s negative correlation between pH and (Citric acid - Fixed acidity).
##
## Pearson's product-moment correlation
##
## data: redwine$alcohol and redwine$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
## NULL
##
## Pearson's product-moment correlation
##
## data: alcohol_above_13$quality and alcohol_above_13$alcohol
## t = -0.39861, df = 21, p-value = 0.6942
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4816540 0.3376049
## sample estimates:
## cor
## -0.08665653
##
## Pearson's product-moment correlation
##
## data: redwine$quality and redwine$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
##
## Pearson's product-moment correlation
##
## data: redwine$quality and redwine$citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
##
## Pearson's product-moment correlation
##
## data: redwine$pH and redwine$citric.acid
## t = -25.767, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5756337 -0.5063336
## sample estimates:
## cor
## -0.5419041
+ pH and Fixed Acidity
##
## Pearson's product-moment correlation
##
## data: redwine$pH and redwine$fixed.acidity
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7082857 -0.6559174
## sample estimates:
## cor
## -0.6829782
- pH with Fixed and Citric Acidity as expected negative correlation between pH and these features.
- Alcohol & Quality :
- Above 13% Alcohol : the quality of wine degrades (negative correlation).
- Under 13% Alcohol : the quality of wine increase (positive correlation).
- Based on the sample data wine quality increases as the sulphates and citric acids contents increase.
Yeah some relationships that aren’t part of this analysis such as a relationship between free sulfur dioxide and total sulfurdioxide.
The strongest realtionship is between fixed acidity and pH index which’s equal to -0.68 (strong negative correlation), then again this was expected.
##
## Pearson's product-moment correlation
##
## data: redwine$citric.acid and redwine$fixed.acidity
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6438839 0.6977493
## sample estimates:
## cor
## 0.6717034
##
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = training_data)
## m2: lm(formula = as.numeric(quality) ~ alcohol + sulphates, data = training_data)
## m3: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity,
## data = training_data)
## m4: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity +
## citric.acid, data = training_data)
## m5: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity +
## citric.acid + fixed.acidity, data = training_data)
## m6: lm(formula = as.numeric(quality) ~ alcohol + sulphates + pH,
## data = training_data)
##
## =====================================================================================================
## m1 m2 m3 m4 m5 m6
## -----------------------------------------------------------------------------------------------------
## (Intercept) 0.070 -0.462* 0.785** 0.761** 0.272 2.101***
## (0.225) (0.228) (0.252) (0.260) (0.290) (0.515)
## alcohol 0.343*** 0.322*** 0.288*** 0.288*** 0.302*** 0.349***
## (0.021) (0.021) (0.020) (0.020) (0.020) (0.021)
## sulphates 1.149*** 0.776*** 0.767*** 0.759*** 0.979***
## (0.144) (0.142) (0.144) (0.143) (0.145)
## volatile.acidity -1.242*** -1.212*** -1.314***
## (0.128) (0.148) (0.149)
## citric.acid 0.054 -0.402*
## (0.135) (0.182)
## fixed.acidity 0.063***
## (0.017)
## pH -0.827***
## (0.150)
## -----------------------------------------------------------------------------------------------------
## R-squared 0.211 0.261 0.327 0.328 0.337 0.284
## adj. R-squared 0.211 0.259 0.325 0.325 0.334 0.281
## sigma 0.722 0.700 0.668 0.668 0.663 0.689
## F 256.479 168.592 154.970 116.165 96.922 126.040
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1047.617 -1016.611 -971.283 -971.203 -964.337 -1001.526
## Deviance 499.109 467.857 425.655 425.585 419.534 453.367
## AIC 2101.233 2041.222 1952.565 1954.407 1942.675 2013.052
## BIC 2115.831 2060.686 1976.895 1983.602 1976.736 2037.381
## N 959 959 959 959 959 959
## =====================================================================================================
High alcohol percentages (below 13%) and high sulphate contents combined result in better wines.
- Low R squared score suggest that there is missing information to correctly predict quality.
## NULL
##
## Pearson's product-moment correlation
##
## data: alcohol_above_13$quality and alcohol_above_13$alcohol
## t = -0.39861, df = 21, p-value = 0.6942
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4816540 0.3376049
## sample estimates:
## cor
## -0.08665653
Based on the sample data the quality of wine keeps on increasing as the alcohol percentage increases until it hits 13% after this percentage the quality of wine starts to degrade.
As shown in the graph above, higher sulphates content and higher alcohol content (but must be below 13%) yields better wine quality.
Around 35% of the variance in quality could be explained with the highest R squared by the linear model.
The red wine dataset contains 1,599 observation with 13 variables.
For future exploration of this data, I believe that having extra information would help in adding more value to the analysis. I would pick one category of wine (for example, Wine Grade A, B, C) to look at the patterns which can appear in each of these three categories.